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1  Introduction 


The  “Expect ation-Maximization”  (EM)  algorithm  is  a 
general  technique  for  maximum  likelihood  (ML)  or  max¬ 
imum  a  posteriori  (MAP)  estimation.  The  recent  em¬ 
phasis  in  the  neural  network  literature  on  probabilistic 
models  has  led  to  increased  interest  in  EM  as  a  possible 
alternative  to  gradient-based  methods  for  optimization. 
EM  has  been  used  for  variations  on  the  traditional  theme 
of  Gaussian  mixture  modeling  (Ghahramani  &  Jordan, 
1994;  Nowlan,  1991;  Xu  &  Jordan,  1993a,  b;  Tresp,  Ah¬ 
mad  &  Neuneier,  1994;  Xu,  Jordan  &  Hinton,  1994)  and 
has  also  been  used  for  novel  chain-structured  and  tree- 
structured  architectures  (Bengio  &  Frasconi,  1995;  Jor¬ 
dan  &  Jacobs,  1994).  The  empirical  results  reported  in 
these  papers  suggest  that  EM  has  considerable  promise 
as  an  optimization  method  for  such  architectures.  More¬ 
over,  new  theoretical  results  have  been  obtained  that 
link  EM  to  other  topics  in  learning  theory  (Amari,  1994; 
Jordan  k  Xu,  1993;  Neal  k  Hinton,  1993;  Xu  k  Jordan, 
1993c;  Yuille,  Stolorz  k  Utans,  1994). 

Despite  these  developments,  there  are  grounds  for 
caution  about  the  promise  of  the  EM  algorithm.  One 
reason  for  caution  comes  from  consideration  of  theoret¬ 
ical  convergence  rates,  which  show  that  EM  is  a  first 
order  algorithm.^  More  precisely,  there  are  two  key  re¬ 
sults  available  in  the  statistical  literature  on  the  con¬ 
vergence  of  EM.  First,  it  has  been  established  that  un¬ 
der  mild  conditions  EM  is  guaranteed  to  converge  to 
a  local  maximum  of  the  log  likelihood  I  (Boyles,  1983; 
Dempster,  Laird  k  Rubin,  1977;  Redner  k  Walker, 
1984;  Wu,  1983).  (Indeed  the  convergence  is  monotonic: 
/(0(A:+i))  >  /(0(^)),  where  is  the  value  of  the  pa¬ 
rameter  vector  ©  at  iteration  k.)  Second,  considering 
EM  as  a  mapping  ==  M(0^*^)  with  fixed  point 

0*  =  M(0"),  we  have  gMQD.(0W-0*) 

when  0(^+^)  is  near  0*,  and  thus 


ll0(*+i)_0*ll<ll^^m||.||0w_e*ll, 


with 


dM{e* 
'  50* 


#0 


almost  surely.  That  is,  EM  is  a  first  order  algorithm. 

The  first-order  convergence  of  EM  has  been  cited  in 
the  statistical  literature  as  a  major  drawback.  Red¬ 
ner  and  Walker  (1984),  in  a  widely-cited  article,  argued 
that  superlinear  (quasi-Newton,  method  of  scoring)  and 
second-order  (Newton)  methods  should  generally  be  pre¬ 
ferred  to  EM.  They  reported  empirical  results  demon¬ 
strating  the  slow  convergence  of  EM  on  a  Gaussian  mix¬ 
ture  model  problem  for  which  the  mixture  components 
were  not  well  separated.  These  results  did  not  include 
tests  of  competing  algorithms,  however.  Moreover,  even 
though  the  convergence  toward  the  “optimal”  parameter 
values  was  slow  in  these  experiments,  the  convergence  in 
likelihood  was  rapid.  Indeed,  Redner  and  Walker  ac¬ 
knowledge  that  their  results  show  that  “...  even  when 


^An  iterative  algorithm  is  said  to  have  a  local  convergence 
rate  of  order  g  >  1  if  —  0*||/||0^^^  —  0*||^  <  r  -\- 

o(||0(^)  _  0*11)  for  k  sufficiently  large. 


the  component  populations  in  a  mixture  are  poorly  sep¬ 
arated,  the  EM  algorithm  can  be  expected  to  produce 
in  a  very  small  number  of  iterations  parameter  values 
such  that  the  mixture  density  determined  by  them  re¬ 
flects  the  sample  data  very  well.”  In  the  context  of  the 
current  literature  on  learning,  in  which  the  predictive 
aspect  of  data  modeling  is  emphasized  at  the  expense  of 
the  traditional  Fisherian  statistician’s  concern  over  the 
“true”  values  of  parameters,  such  rapid  convergence  in 
likelihood  is  a  major  desideratum  of  a  learning  algorithm 
and  undercuts  the  critique  of  EM  as  a  “slow”  algorithm. 

In  the  current  paper,  we  provide  a  comparative  anal¬ 
ysis  of  EM  and  other  optimization  methods.  We  empha¬ 
size  the  comparison  between  EM  and  other  first-order 
methods  (gradient  ascent,  conjugate  gradient  methods), 
because  these  have  tended  to  be  the  methods  of  choice 
in  the  neural  network  literature.  However,  we  also  com¬ 
pare  EM  to  superlinear  and  second-order  methods.  We 
argue  that  EM  has  a  number  of  advantages,  including  its 
naturalness  at  handling  the  probabilistic  constraints  of 
mixture  problems  and  its  guarantees  of  convergence.  We 
also  provide  new  results  suggesting  that  under  appropri¬ 
ate  conditions  EM  may  in  fact  approximate  a  superlin¬ 
ear  method;  this  would  explain  some  of  the  promising 
empirical  results  that  have  been  obtained  (Jordan  &  Ja¬ 
cobs,  1994),  and  would  further  temper  the  critique  of  EM 
offered  by  Redner  and  Walker.  The  analysis  in  the  cur¬ 
rent  paper  focuses  on  unsupervised  learning;  for  related 
results  in  the  supervised  learning  domain  see  Jordan  and 
Xu  (in  press). 

The  remainder  of  the  paper  is  organized  as  follows. 
We  first  briefly  review  the  EM  algorithm  for  Gaussian 
mixtures.  The  second  section  establishes  a  connection 
between  EM  and  the  gradient  of  the  log  likelihood.  We 
then  present  a  comparative  discussion  of  the  advantages 
and  disadvantages  of  various  optimization  algorithms  in 
the  Gaussian  mixture  setting.  We  then  present  empir¬ 
ical  results  suggesting  that  EM  regularizes  the  condi¬ 
tion  number  of  the  effective  Hessian.  The  fourth  section 
presents  a  theoretical  analysis  of  this  empirical  finding. 
The  final  section  presents  our  conclusions. 


2  The  EM  algorithm  for  Gaussian 
mixtures 


We  study  the  following  probabilistic  model: 
K 

P(a:|0)  = 


(1) 


i=i 


and 

P[x\mj ,  Ej)  = 


(27r)^/2|E;f/2^ 


where  aj  >  0  and  ^f-i  otj  =  1  and  d  is  the  dimension-  — 
ality  of  the  vector  x.  The  parameter  vector  0  consists  — 
of  the  mixing  proportions  ,  the  mean  vectors  nij ,  and 
the  covariance  matrices  Ej . 

Given  K  and  given  N  independent,  identically  dis- 


3^' 

□  1 
□  ’ 


tributed  samples  ,  we  obtain  the  following  log  od® 


4 


*1,  \ 


likelihood:^ 

/(©)  =  log n  =  J]logF(^^')|e),  (2) 


N 


N 


t=l 


t=l 


which  can  be  optimized  via  the  following  iterative  algo¬ 
rithm  (see,  e.g,  Dempster,  Laird  &  Rubin,  1977): 


(>«,  _  ZL 


(3) 


-.(i+l) 


it) 


N 

Et^i  -  mf -  mf 

E'^=ihf{t)xW 


dk) 


where  the  posterior  probabilities  hj  ^  are  defined  as  fol¬ 
lows: 


3  Connection  between  EM  and 
gradient  ascent 

In  the  following  theorem  we  establish  a  relationship  be¬ 
tween  the  gradient  of  the  log  likelihood  and  the  step  in 
parameter  space  taken  by  the  EM  algorithm.  In  par¬ 
ticular  we  show  that  the  EM  step  can  be  obtained  by 
premultiplying  the  gradient  by  a  positive  definite  ma¬ 
trix.  We  provide  an  explicit  expression  for  the  matrix. 

Theorem  1  At  each  iteration  of  the  EM  algorithm  Eg. 
(3),  we  have 


(4) 

(fc  +  l)  (A;) 

mj  ^  -  my  = 

drrij 

(5) 

vec[sj^'*'^^]  —  vec[Ej^^]  = 

pik)  dl 

5vec[Ej] 

(6) 

where 

pd) 

=  ;^{diag[Q;l*'\  •  •  • , 

p{k) 

(8) 

pik) 

_  2  V'(^) 

(9) 

where  A  denotes  the  vector  of  mixing  proportions 
[ofi,  •  •  •  j  j  indexes  the  mixture  components  (j  = 

1,"  ■  ,K),  k  denotes  the  iteration  number,  \ec[B]’^  de¬ 
notes  the  vector  obtained  by  stacking  the  column  vectors 

^Although  we  focus  on  maximum  likelihood  (ML)  estima¬ 
tion  in  this  paper,  it  is  straightforward  to  apply  our  results 
to  maximum  a  posteriori  (MAP)  estimation  by  multiplying 
the  Ukehhood  by  a  prior. 


of  the  matrix  B,  and  "0  ”  denotes  the  Kronecker  prod¬ 
uct.  Moreover,  given  the  constraints  ^ 

>  Oj  is  ^  positive  definite  matrix  and  the  ma¬ 
trices  Pmj  and  are  positive  definite  with  probability 
one  for  N  sufficiently  large. 


Proof.  (1)  We  begin  by  considering  the  EM  update 
for  the  mixing  proportions  a^.  From  Eqs.  (1)  and  (2), 
we  have 

dl  ,  _  ^  •  •  • ,  P(x(0, 4*))]^’ 

^ - 


Premultiplying  by  P^\  we  obtain 
[*) 


pik)  I 

Pa  ^\A=Ai>^) 


1  ^ 


Ef=iap^P(x(0.4)) 


«  =  1 
AT 


t  =  l 


The  update  formula  for  A  in  Eq.  (3)  can  be  rewritten  as 

_4(*:  +  l)  ^  ^  1  .  .  .  ,  -  ^(0. 


Combining  the  last  two  equations  establishes  the  update 
rule  for  A  (Eq.  4).  Furthermore,  for  an  arbitrary  vec¬ 
tor  u,  we  have  NvdP^^u  =  u^diag[ai*  V  *  *  >  - 

(yd' .  By  Jensen’s  inequality  we  have 


u^diag[Q;^^\  •  ■  • ,  a' 


v(^)l 

K  J 


v(*)u2 


K 

=  E4 

;=1 

i  =  l 

Thus,  vd P^\  >  0  and  P^^  is  positive  definite  given 
the  constraints  —  1  ^ind  af^  >  0  for  all  j. 

(2)  We  now  consider  the  EM  update  for  the  means 
rrii.  It  follows  from  Eqs.  (1)  and  (2)  that 

N 


—  I 

druj 


Premultiplying  by  Pm^  yields 
dl  ,  1 


N 


druj 


(fc)  — 


_  ^(k  +  l)  (k) 

=  rrij  —  rrij. 

.N  r(fc). 


From  Eq.  (3),  we  have  >  0;  moreover, 

is  positive  definite  with  probability  one  assuming  that  N 


is  large  enough  such  that  the  matrix  is  of  full  rank.  Thus, 
it  follows  from  Eq.  (8)  that  Pm^  is  positive  definite  with 
probability  one. 

(3)  Finally,  we  prove  the  third  part  of  the  theorem. 
It  follows  from  Eqs.  (1)  and  (2)  that 


-^1 


With  this  in  mind,  we  rewrite  the  EM  update  formula 


for  as 


N 


-m)‘ 


(*)i 


=  + 


(0  _  mfY  -  Sf  ^ 


t(A;) 


where 


Er=i  hr  it) 


t=i 


{Ef  ^  -  mf  -  mf  )]^}(Ef  V 

dl 

~  aE,  'Si=sf>- 
That  is,  we  have 


y'C^+l)  _  yi^)  _| _ _ (  yW 


Utilizing  the  identity  Yec[ABC]  =  {C^  0  A)vec[5],  we 
obtain 


vecpj^’^^^]  =  vec[Ej*^^] 


+ 


sv 


(fc). 


ELhfit) 

Thus  =  =77-^(jy— (Ej*^  (g)  E^*^).  Moreover,  for  an 

2L./<=rl  ^  '' 

arbitrary  matrix  U,  we  have 


vec[U’]^(E^-*^  0  Ej*^)vec[f/] 

=  tr((Ej-^^C/f  (eJ^V)) 

=  vec  [Ej^  ^  U]'^  vecpj^  V] 

>  0, 


where  equality  holds  only  when  =  0  for  all  U . 

Equality  is  impossible,  however,  since  E^-^^  is  positive 
definite  with  probability  one  N  is  sufficiently  large.  Thus 
it  follows  from  Eq.  (9)  and  J2^=i  ^  that 

is  positive  definite  with  probability  one.  □ 


Using  the  notation 

©  =  [”^f ,  •  ■  • ,  vec[Ei]^,  •  •  • ,  vec[Efi:]^,>l^]^, 


and  P(0)  =  diag[P,„^ ,  •  •  • ,  P„^ ,  Ps, ,  •  •  • ,  Pp^ ,  P^],  we 
can  combine  the  three  updates  in  Theorem  1  into  a  single 
equation: 


e(>‘+^)  =  QW  +  P(Q(Y~\e=e('‘)^  (10) 

Under  the  conditions  of  Theorem  1,  P(0^^^)  is  a  positive 
definite  matrix  with  probability  one.  Recalling  that  for 

a  positive  definite  matrix  P,  we  have  >  0,  we 

have  the  following  corollary: 

Corollary  1  For  each  iteration  of  the  EM  algorithm 
given  by  Eq.(3),  the  search  direction  _  0(^) 

a  positive  projection  on  the  gradient  of  L 

That  is,  the  EM  algorithm  can  be  viewed  as  a  variable 
metric  gradient  ascent  algorithm  for  which  the  projection 
matrix  P(0(^^)  changes  at  each  iteration  as  a  function 
of  the  current  parameter  value  0^^^. 

Our  results  extend  earlier  results  due  to  Baum  and 
Sell  (1968).  Baum  and  Sell  studied  recursive  equations 
of  the  following  form: 


T(x©)) 


T(x©)) 

[r(2;©))i,.-.,T(x(*V] 

x^dJ/dx^ 


where  >  0,  E^i  =  li  where  J  is  a  polyno- 

miai  in  x\  having  positive  coefficients.  They  showed 
that  the  search  direction  of  this  recursive  formula,  i.e., 
T{x^^^)  -  x^^\  has  a  positive  projection  on  the  gradient 
of  of  J  with  respect  to  the  (see  also  Levinson,  Ra- 
biner  &  Sondhi,  1983).  It  can  be  shown  that  Baum  and 
SelTs  recursive  formula  implies  the  EM  update  formula 
for  >1  in  a  Gaussian  mixture.  Thus,  the  first  statement 
in  Theorem  1  is  a  special  case  of  Baum  and  SelFs  earlier 
work.  However,  Baum  and  Sell’s  theorem  is  an  existence 
theorem  and  does  not  provide  an  explicit  expression  for 
the  matrix  P^  that  transforms  the  gradient  direction 
into  the  EM  direction.  Our  theorem  provides  such  an 
e.xplicit  form  for  P^.  Moreover,  we  generalize  Baum  and 
SeiTs  results  to  handle  the  updates  for  ruj  and  Ej,  and 
we  provide  explicit  expressions  for  the  positive  definite 
transformation  matrices  Pmj  and  P^.  as  well. 

It  is  also  worthwhile  to  compare  the  EM  algorithm  to 
other  gradient-based  optimization  methods.  Newton's 
method  is  obtained  by  premultiplying  the  gradient  by 
tlie  inverse  of  the  Hessian  of  the  log  likelihood: 


0(^+1)  =  0(i)  + 


(11) 


Newton’s  method  is  the  method  of  choice  when  it  can 
be  applied,  but  the  algorithm  is  often  difficult  to  use 
in  practice.  In  particular,  the  algorithm  can  diverge 
when  the  Hessian  becomes  nearly  singular;  moreover, 
the  computational  costs  of  computing  the  inverse  Hes¬ 
sian  at  each  step  can  be  considerable.  An  alternative 


3 


is  to  approximate  the  inverse  by  a  recursively  updated 
matrix  +  t]AB^^^.  Such  a  modification 

is  called  a  quasi- Newton  method.  Conventional  quasi- 
Newton  methods  are  unconstrained  optimization  meth¬ 
ods,  however,  and  must  be  modified  in. order  to  be  used 
in  the  mixture  setting  (where  there  are  probabilistic  con¬ 
straints  on  the  parameters).  In  addition,  quasi-Newton 
methods  generally  require  that  a  one-dimensional  search 
be  performed  at  each  iteration  in  order  to  guarantee  con¬ 
vergence.  The  EM  algorithm  can  be  viewed  as  a  special 
form  of  quasi-Newton  method  in  which  the  projection 
matrix  in  Eq.  (10)  plays  the  role  of  As 

we  discuss  in  the  remainder  of  the  paper,  this  partic¬ 
ular  matrix  has  a  number  of  favorable  properties  that 
make  EM  particularly  attractive  for  optimization  in  the 
mixture  setting. 


4  Constrained  optimization  and  general 
convergence 

An  important  property  of  the  matrix  P  is  that  the  EM 
step  in  parameter  space  automatically  satisfies  the  prob¬ 
abilistic  constraints  of  the  mixture  model  in  Eq.  (1). 
The  domain  of  0  contains  two  regions  that  embody  the 

probabilistic  constraints:  Vi  =  {&  :  J2f=i  —  1} 

p2  =  {0  :  >  0,  Ej  positive  definite}.  For  the  EM 

algorithm  the  update  for  the  mixing  proportions  aj  can 
be  rewritten  as  follows: 


(Wi)  _  1  ^ 

It  is  obvious  that  the  iteration  stays  within  Vi.  Simi¬ 
larly,  the  update  for  Ey  can  be  rewritten  as: 


e(^+i) 


1  ^ 

Er=i  hf\t)  hi  Ef=i 


which  stays  within  V2  for  N  sufficiently  large. 

Whereas  EM  automatically  satisfies  the  probabilistic 
constraints  of  a  mixture  model,  other  optimization  tech¬ 
niques  generally  require  modification  to  satisfy  the  con¬ 
straints.  One  approach  is  to  modify  each  iterative  step 
to  keep  the  parameters  within  the  constrained  domain. 
A  number  of  such  techniques  have  been  developed,  in¬ 
cluding  feasible  direction  methods,  active  sets,  gradient 
projection,  reduced-gradient,  and  linearly-constrained 
quasi-Newton.  These  constrained  methods  all  incur  ex¬ 
tra  computational  costs  to  check  and  maintain  the  con¬ 
straints  and,  moreover,  the  theoretical  convergence  rates 
for  such  constrained  algorithms  need  not  be  the  same  as 
that  for  the  corresponding  unconstrained  algorithms.  A 
second  approach  is  to  transform  the  constrained  opti¬ 
mization  problem  into  an  unconstrained  problem  before 
using  the  unconstrained  method.  This  can  be  accom¬ 
plished  via  penalty  and  barrier  functions,  Lagrangian 
terms,  or  re-parameterization.  Once  again,  the  extra  al¬ 
gorithmic  machinery  renders  simple  comparisons  based 


on  unconstrained  convergence  rates  problematic.  More¬ 
over,  it  is  not  easy  to  meet  the  constraints  on  the  covari¬ 
ance  matrices  in  the  mixture  using  such  techniques. 

A  second  appealing  property  of  P(0^^^)  is  that  each 
iteration  of  EM  is  guaranteed  to  increase  the  likelihood 
(i.e.,  /(0(^+^^)  >  /(0(^))).  This  monotonic  convergence 
of  the  likelihood  is  achieved  without  step-size  parameters 
or  line  searches.  Other  gradient-based  optimization  tech¬ 
niques,  including  gradient  descent,  quasi-Newton,  and 
Newton’s  method,  do  not  provide  such  a  simple  theo¬ 
retical  guarantee,  even  assuming  that  the  constrained 
problem  has  been  transformed  into  an  unconstrained 
one.  For  gradient  ascent,  the  step  size  7?  must  be  chosen 
to  ensure  that  —  0(^”^^||/||(0‘^^)  ~  0(^”O)||  < 

||J-hr/iJ(0(^“^)))||  <  1.  This  requires  a  one-dimensional 
line  search  or  an  optimization  of  rj  at  each  iteration, 
which  requires  extra  computation  which  can  slow  down 
the  convergence.  An  alternative  is  to  fix  77  to  a  very 
small  value  which  generally  makes  ||/  -h  r7//'(0^*“^)))|| 
close  to  one  and  results  in  slow  convergence.  For  New¬ 
ton’s  method,  the  iterative  process  is  usually  required 
to  be  near  a  solution,  otherwise  the  Hessian  may  be  in¬ 
definite  and  the  iteration  may  not  converge.  Levenberg- 
Marquardt  methods  handle  the  indefinite  Hessian  ma¬ 
trix  problem;  however,  a  one-dimensional  optimization 
or  other  form  of  search  is  required  for  a  suitable  scalar 
to  be  added  to  the  diagonal  elements  of  Hessian.  Fisher 
scoring  methods  can  also  handle  the  indefinite  Hessian 
matrix  problem,  but  for  non-quadratic  nonlinear  opti¬ 
mization  Fisher  scoring  requires  a  stepsize  t]  that  obeys 
||/  -|-  r}BH{Q^^^^^))\\  <  1,  where  B  is  the  Fisher  infor¬ 
mation  matrix.  Thus,  problems  similar  to  those  of  gra¬ 
dient  ascent  arise  here  as  well.  Finally,  for  the  quasi- 
Newton  methods  or  conjugate  gradient  methods,  a  one¬ 
dimensional  line  search  is  required  at  each  iteration.  In 
summary,  all  of  these  gradient-based  methods  incur  ex¬ 
tra  computational  costs  at  each  iteration,  rendering  sim¬ 
ple  comparisons  based  on  local  convergence  rates  unre¬ 
liable. 

For  large  scale  problems,  algorithms  that  change  the 
parameters  immediately  after  each  data  point  ( “on-line 
algorithms”)  are  often  significantly  faster  in  practice 
than  batch  algorithms.  The  popularity  of  gradient  de¬ 
scent  algorithms  for  neural  networks  is  in  part  to  the 
ease  of  obtaining  on-line  variants  of  gradient  descent. 
It  is  worth  noting  that  on-line  variants  of  the  EM  algo¬ 
rithm  can  be  derived  (Neal  &  Hinton,  1993,  Titterington, 
1984),  and  this  is  a  further  factor  that  weighs  in  favor 
of  EM  as  compared  to  conjugate  gradient  and  Newton 
methods. 

5  Convergence  rate  comparisons 

In  this  section,  we  provide  a  comparative  theoretical  dis¬ 
cussion  of  the  convergence  rates  of  constrained  gradient 
ascent  and  EM. 

For  gradient  ascent  a  local  convergence  result  can  by 
obtained  by  Taylor  expanding  the  log  likelihood  around 
the  maximum  likelihood  estimate  0“^ .  For  suflficiently 
large  k  we  have: 

Il0(i+i)  -  0*11  <  ||/  +  v^f(0*))||||(e(*^)  -  0*)||  (12) 


and 


Since  H{Q*)  is  negative  definite,  we  obtain 


||/  +  r]H{en\\  <  Am[/  +  vH{en]  =  (13) 

where  H  is  the  Hessian  of  /,  rj  is  the  step  size,  and 
r  =  max{|l  -  r]XM[-H{Q*)]\,  |1  -  7/Am[-F(0*)]|}, 
where  Xm[^]  and  denote  the  largest  and  small¬ 

est  eigenvalues  of  respectively. 

Smaller  values  of  r  correspond  to  faster  convergence 
rates.  To  guarantee  convergence,  we  require  r  <  1  or 
0  <  r?  <  2/Xm[—H{Q^)],  The  minimum  possible  value 
of  r  is  obtained  when  rj  =  1/Xm[H(Q*)]  with 

=  i^Xm[H{e*)]/XM[H{en] 

=  i-K-^[H{e*)], 

where  =  Xm[H]/ Xm[H]  is  the  condition  number  of 
H.  Larger  values  of  the  condition  number  correspond  to 
slower  convergence.  When  k[H]  =  1  we  have  rmin  =  0, 
which  corresponds  to  a  superlinear  rate  of  convergence. 
Indeed,  Newton^s  method  can  be  viewed  as  a  method 
for  obtaining  a  more  desirable  condition  number — the 
inverse  Hessian  balances  the  Hessian  H  such  that 
the  resulting  condition  number  is  one.  Effectively,  New¬ 
ton  can  be  regarded  as  gradient  ascent  on  a  new  func¬ 
tion  with  an  effective  Hessian  that  is  the  identity  matrix: 
Heff  =  =  L  In  practice,  however,  k[H]  is  usually 

quite  large.  The  larger  k[H]  is,  the  more  difficult  it  is  to 
compute  accurately.  Hence  it  is  difficult  to  balance 
the  Hessian  as  desired.  In  addition,  as  we  mentioned 
in  the  previous  section,  the  Hessian  varies  from  point 
to  point  in  the  parameter  space,  and  at  each  iteration 
we  need  recompute  the  inverse  Hessian.  Quasi-Newton 
methods  approximate  by  a  positive  matrix 

5^^)  that  is  easy  to  compute. 

The  discussion  thus  far  has  treated  unconstrained  op¬ 
timization.  In  order  to  compare  gradient  ascent  with 
the  EM  algorithm  on  the  constrained  mixture  estima¬ 
tion  problem,  we  consider  a  gradient  projection  method: 

0(*:  +  l)  =  0W  +  ,,n*^  (14) 

where  Hj^  is  the  projection  matrix  that  projects  the  gra¬ 
dient  into  Vi.  This  gradient  projection  iteration 

will  remain  in  Vi  as  long  as  the  initial  parameter  vector 
is  in  Vi.  To  keep  the  iteration  within  T>2j  we  choose  an 
initial  £  P2  and  keep  rj  sufficiently  small  at  each 
iteration. 

Suppose  that  E  —  [ci,  •  •  '^em]  are  a  set  of  indepen¬ 
dent  unit  basis  vectors  that  span  the  space  T>i.  In  this 
basis,  0^*^  and  become  0^^^  =  and 

aihc  =  respectively,  virith  ll©!*"!  -  e*||  = 

||0(^)  ~  0*||.  In  this  representation  the  projective  gradi¬ 
ent  algorithm  Eq.  (14)  becomes  simple  gradient  ascent: 
0(A:+1)  _  q(A:)  +  V QQ(k)  ♦  Moreover,  Eq.  (12)  becomes 

||0(^+i)_0*||  <  |1^^V  +  ^^(0*))IIII(0^^^-0*)II-  As 
a  result,  the  convergence  rate  is  bounded  by 

r,  =  \\E'^iI  +  r,Hie*))\\ 

<  +  T,Hie*))iI  +  r,H{Q*))TE] 

=  ^Xm[ET(I  +  2r)H{Q*)  +  Ti^m{e*))E]. 


rc  <  \Jl  +  -  2r]X,n[-H^].  (15) 

In  this  equation  =  E'^H{Q)E  is  the  Hessian  of  / 
restricted  to  Vi. 

We  see  from  this  derivation  that  the  convergence 
speed  depends  on  k[Hc]  -  XM[-Hc]/Xm[-Hc].  When 
k[Hc]  =  1,  we  have  y^l  -f  r}^Xlf{~Hl)  ~  2r}Xm[-Hc]  = 
1  —  r}X[—Hc],  which  in  principle  can  be  made  to  equal 
zero  if  rj  is  selected  appropriately.  In  this  case,  a  super- 
linear  rate  is  obtained.  Generally,  however,  k[Hc]  ^  1, 
with  smaller  values  of  k[Hc]  corresponding  to  faster  con¬ 
vergence. 

We  now  turn  to  an  analysis  of  the  EM  algorithm.  As 
we  have  seen  EM  keeps  the  parameter  vector  within  Vi 
automatically.  Thus,  in  the  new  basis  the  connection 
between  EM  and  gradient  ascent  (cf.  Eq.  (10))  becomes 

0(<:  +  l)  _  0(*:)  ^ 
and  we  have 

Il0(fc+i)  -  0*11  <  \\E'^{i  +  PF(e*))||||(0(*=)  -  0*)|| 
with 

r,  =  \\E^{I  +  PH{e*m 

<  +  PHie*)){i  +  PHie*)fE]. 

The  latter  equation  can  be  further  manipulated  to  yield: 
rc  <  +  Xlj[ETpHE]  -  2Xm[-ETpHE].  (16) 

Thus  we  see  that  the  convergence  speed  of  EM  de¬ 
pends  on  k[E^PHE]  =  XM[E'^PHE]/Xm[E^PHEl 
When  k[E^PHE]  =  1,  Xm[E^PHE]  =  1,  we 

have  y/l  -{-  Xl^[E^PHE]  ^  2Xm[-E^PHE]  =(1~ 
Xm[—E'^PHE])  =  0.  In  this  case,  a  superlinear  rate 
is  obtained.  We  discuss  the  possibility  of  obtaining  su¬ 
perlinear  convergence  with  EM  in  more  detail  below. 

These  results  show  that  the  convergence  of  gradient 
ascent  and  EM  both  depend  on  the  shape  of  the  log  likeli¬ 
hood  as  measured  by  the  condition  number.  When 
is  near  one,  the  configuration  is  quite  regular,  and  the 
update  direction  points  directly  to  the  solution  yielding 
fast  convergence.  When  k[H]  is  very  large,  the  I  sur¬ 
face  has  an  elongated  shape,  and  the  search  along  the 
update  direction  is  a  zigzag  path,  making  convergence 
very  slow.  The  key  idea  of  Newton  and  quasi-Newton 
methods  is  to  reshape  the  surface.  The  nearer  it  is  to  a 
ball  shape  (Newton’s  method  achieves  this  shape  in  the 
ideal  case),  the  better  the  convergence.  Quasi-Newton 
methods  aim  to  achieve  an  effective  Hessian  whose  con¬ 
dition  number  is  as  close  as  possible  to  one.  Interest¬ 
ingly,  the  results  that  we  now  present  suggest  that  the 
projection  matrix  P  for  the  EM  algorithm  also  serves 
to  effectively  reshape  the  likelihood  yielding  an  effective 
condition  number  that  tends  to  one.  We  first  present 
empirical  results  that  support  this  suggestion  and  then 
present  a  theoretical  analysis. 


(a)  (b) 

Figure  1:  Experimental  results  for  the  estimation  of  the 
parameters  of  a  two-component  Gaussian  mixture,  (a) 
The  condition  numbers  as  a  function  of  the  iteration 
number,  (b)  A  zoomed  version  of  (a)  after  discarding 
the  first  25  iterations.  The  terminology  ‘original,  con¬ 
strained,  and  EM-equivalent  Hessians’  refers  to  the  ma¬ 
trices  and  E^PHE  respectively. 


We  sampled  1000  points  from  a  simple  finite  mixture 
model  given  by 

p{x)  =  0(ipi{x)  -f  a2P2{x) 


where 


Pi{x)  = 


1  (a;  • 
•exp{--  — 


2  I  ci  ^2 

iraf  I  (Tz 


-}• 


The  parameter  values  were  as  follows;  ai  = 
0.7170,  0^2  =  0.2830,  mi  =  -2,  =  2,  <rj  =  1,  = 

1.  We  ran  both  the  EM  algorithm  and  gradient  ascent 
on  the  data.  At  each  step  of  the  simulation,  we  calcu¬ 
lated  the  condition  number  of  the  Hessian  (0(^))]), 
the  condition  number  determining  the  rate  of  conver¬ 
gence  of  the  gradient  algorithm  {k[E'^ H E]) ,  and 
the  condition  number  determining  the  rate  of  conver¬ 
gence  of  EM  (/c[F;^P(0(*))77(0(*))F;]).  We  also  calcu¬ 
lated  the  largest  eigenvalues  of  the  matrices 
E^H{e(^^)E,  and  E'^ P{e(^))H{e^^^)E.  The  results 
are  shown  in  Fig.  1.  As  can  be  seen  in  Fig.  1(a),  the  con¬ 
dition  numbers  change  rapidly  in  the  vicinity  of  the  25th 
iteration  and  the  corresponding  Hessian  matrices  be¬ 
come  indefinite.  Afterward,  the  Hessians  quickly  become 
definite  and  the  condition  numbers  converge.^  As  shown 
in  Fig.  1(b),  the  condition  numbers  converge  toward  the 
values  k[H{Q^^^)]  =  47.5,  k[E'^ H{Q^^^)E]  =  33.5,  and 
/c[^^P(0(^))i/(0(^))^]  =  3.6.  That  is,  the  matrix  P 
has  greatly  reduced  the  condition  number,  by  factors  of 
9  and  15.  This  significantly  improves  the  shape  of  /  and 
speeds  up  the  convergence. 

We  ran  a  second  experiment  in  which  the  means  of  the 
component  Gaussians  were  mi  =  —  1  and  m2  =  1.  The 
results  are  similar  to  those  shown  in  Fig.  1.  Since  the 
distance  between  two  distributions  is  reduced  into  half, 

^Interestingly,  the  EM  algorithm  converges  soon  afterward 
as  weU,  showing  that  for  this  problem  EM  spends  little  time 
in  the  region  of  parameter  space  in  which  a  local  analysis  is 
valid. 


Figure  2:  Experimental  results  for  the  estimation  of  the 
parameters  of  a  two-component  Gaussian  mixture  (cf. 
Fig.  1).  The  separation  of  the  Gaussians  is  half  the 
separation  in  Fig.  1. 


Figure  3:  The  largest  eigenvalues  of  the  matrices 
H,  E'^HE,  and  E'^PHE  plotted  as  a  function  of  the 
number  of  iterations.  The  plot  in  (a)  is  for  the  experi¬ 
ment  in  Fig.  1;  (b)  is  for  the  experiment  reported  in  Fig. 
2. 


the  shape  of  /  becomes  more  irregular.  The  condition 
number  «:[iir(0^^))]  increases  to  352,  k[E'^ H{Q^^^)E]  in¬ 
creases  to  216,  and  k[E'^ P{Q^^^)H{Q^^'^)E]  increases  to 
61.  We  see  once  again  a  significant  improvement  in  the 
case  of  EM,  by  factors  of  4  and  6. 

Fig.  3  shows  that  the  matrix  P  has  also  reduced  the 
largest  eigenvalues  of  the  Hessian  from  between  2000  to 
3000  to  around  1.  This  demonstrates  clearly  the  sta¬ 
ble  convergence  that  is  obtained  via  EM,  without  a  line 
search  or  the  need  for  external  selection  of  a  learning 
stepsize. 

In  the  remainder  of  the  paper  we  provide  some  theo¬ 
retical  analyses  that  attempt  to  shed  some  light  on  these 
empirical  results.  To  illustrate  the  issues  involved,  con¬ 
sider  a  degenerate  mixture  problem  in  which  the  mixture 
has  a  single  component.  (In  this  case  ai  =  1.)  Let  us  fur¬ 
thermore  assume  that  the  covariance  matrix  is  fixed  (i.e., 
only  the  mean  vector  m  is  to  be  estimated).  The  Hes¬ 
sian  with  respect  to  the  mean  m  is  77  =  — and  the 
EM  projection  matrix  P  is  E/N.  For  gradient  ciscent,  we 
have  k[E'^HE]  —  which  is  larger  than  one  when¬ 

ever  E  ^  cl.  EM,  on  the  other  hand,  achieves  a  condi¬ 
tion  number  of  one  exactly  {k[E'^ PHE]  =  ^[PH]  = 
k[I]  =  1  and  Xm[E^PHE]  =  1).  Thus,  EM  and  New- 
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ton’s  method  are  the  same  for  this  simple  quadratic 
problem.  For  general  non-quadratic  optimization  prob¬ 
lems,  Newton  retains  the  quadratic  assumption,  yield¬ 
ing  fast  convergence  but  possible  divergence.  EM  is 
a  more  conservative  algorithm  that  retains  the  conver¬ 
gence  guarantee  but  also  maintains  quasi- Newton  be¬ 
havior.  We  now  analyze  this  behavior  in  more  detail. 
We  consider  the  special  case  of  estimating  the  means  in 
a  Gaussian  mixture  when  the  Gaussians  are  well  sepa¬ 
rated. 

Theorem  2  Consider  the  EM  algorithm  in  Eq.  (3), 
where  the  parameters  aj  and  Ej  are  assumed  to  be 
known.  Assume  that  the  K  Gaussian  distributions  are 
well  separated,  such  that  for  sufficiently  large  k  the  pos¬ 
terior  probabilities  h!j^\t)  are  nearly  zero  or  one.  For 
such  k,  the  condition  number  associated  with  EM  is  al¬ 
ways  smaller  than  the  condition  number  associated  with 
gradient  ascent.  That  is: 

k[E'^ <  K[E'^H{e^'‘'>)E]. 

Furthermore,  Xm[E'^ P(Q^'‘^)H{Q^'’^)E]  approaches  one 
as  k  goes  to  infinity. 


Proof.  The  Hessian  is 


H  = 


r^fii 

Hi2  ■ 

H21 

H22  ■ 

••  H2K 

.Hki 

Hk2  ■ 

■  •  Hkk . 

where 


Hij  = 


d^l 


drriidmj 


(17) 


(18) 


=  -(Ef  hhf\t)  +  (sf  V 
«  =  1 

with  =  (fiij  —  h!f\t))h^l^\i) .  The  projection 

matrix  P  is 


where 


pW  =  diag[Pi« 


p(^)  __  _ 

2^tzzl 


Given  that  is  negligible  for  sufR- 

ciently  large  k,  the  second  terra  in  Eq.  (19)  can  be 
neglected,  yielding  Hu  =  and 

H  =  diag[i7ii,  •  •  • ,  Hkk]-  This  implies  that  PH  =  —I, 
and  thus  k[PH]  =  1,  whereas  k[H]  7^  1.  □ 


This  theorem,  although  restrictive  in  its  assumptions, 
gives  some  indication  as  to  why  the  projection  matrix 
in  the  EM  algorithm  appears  to  condition  the  Hessian, 
yielding  improved  convergence.  In  fact,  we  conjecture 


that  the  theorem  can  be  extended  to  apply  more  widely, 
in  particular  to  the  case  of  the  full  EM  update  in  which 
the  mixing  proportions  and  covariances  are  estimated, 
and  also,  within  limits,  to  cases  in  which  the  means  are 
not  well  separated.  To  obtain  an  initial  indication  as  to 
possible  conditions  that  can  be  usefully  imposed  on  the 
separation  of  the  mixture  components,  we  have  stud¬ 
ied  the  case  in  which  the  second  term  in  Eq.  (19)  is 
neglected  only  for  Hu  and  is  retained  for  the  Hij  com¬ 
ponents,  where  j  ^  i.  Consider,  for  example,  the  case 
of  a  univariate  mixture  having  two  mixture  components. 
For  fixed  mixing  proportions  and  fixed  covariances,  the 
Hessian  matrix  (Eq.  17)  becomes: 


1^12 

^22 


and  the  projection  matrix  (Eq.  19)  becomes: 


P  = 


•Uil  u 

^  ■~^22^ 


where 


and 


N 


2(ib) 

t=l 


^  2(k)  2(t) 

tzzl 


for  i  zjz  j  ~  lj2.  If  H  is  negative  definite,  (i.e.,  /iu^22  — 
^12^21  <  0),  then  we  can  show  that  the  conclusions  of 
Theorem  2  remain  true,  even  for  Gaussians  that  are  not 
necessarily  well-separated.  The  proof  is  achieved  via  the 
following  lemma: 

Lemma  1  Consider  the  positive  definite  matrix 


E  = 


<^21 


<^12 

^^22 


For  the  diagonal  matrix  B  =  diag[cr^]L^,  C722^]; 
k[5E]  <  k[E]. 


Proof.  The  eigenvalues  of  E  are  the  roots  of  (cth  — 
'^)(<^22  —  A)  —  <r2icri2  =  0,  which  gives 

—  ^11  +  cr22  +  7 
2 

_  +  ^22  T 

2 _ 

=  \/(u'ii  -h  <722)^  —  4((7n(722  •“  <^21^12) 

=  ^ll±i5l±2 

+  cr22  -  7 

The  condition  number  /c[E]  can  be  written  as  /c[E]  = 
(1  +  5)/(l  —  s)  =  /(s),  where  s  is  defined  as  follows: 

1  -  "^(^11^22  —  <^2lO'l2) 

{(Til  +  <^22)^ 


Am 

Am 

7 


and 
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Furthermore,  the  eigenvalues  of  5E  are  the  roots 
of  (1  -  A)(l  -  A)  -  (<Toicri2)/(crii(j22)  =  0,  which 
gives  Am  =  1  +  v/(p-2i<ri2)/(o'iiO'22)  and  = 

1  -  \/(<T2i<Ti2)/(g-iig'227-  Thus,  defining  r  = 
v/(o-2iO'i2)/(o'ii<7'22),  we  have  k[BE]  =  (l  +  r)/(l-r)  = 

fir)- 

We  now  examine  the  quotient  s/r: 

s  _  I  r  4(1  —  r^) 

r  “  r  y  (cTii  +  cr22)^/(crncr22) 

Given  that  {cru  +  (J22)^/(criicr22)  >  4,  we  have  ~  > 
—  (1  —  r^)  =  1.  That  is,  s  >  r.  Since  /(a?)  = 
(l  +  a?)/(l  —  x)  is  a  monotonically  increasing  function  for 
X  >  0,  we  have  f{s)  >  /(r).  Therefore,  «:[BE]  <  /c[E]. 
□ 

We  think  that  it  should  be  possible  to  generalize 
this  lemma  beyond  the  univariate,  two-component  case, 
thereby  weakening  the  conditions  on  separability  in  The¬ 
orem  2  in  a  more  general  setting. 

6  Conclusions 

In  this  paper  we  have  provided  a  comparative  analysis 
of  algorithms  for  the  learning  of  Gaussian  mixtures.  We 
have  focused  on  the  EM  algorithm  and  have  forged  a  link 
between  EM  and  gradient  methods  via  the  projection 
matrix  P.  We  have  also  analyzed  the  convergence  of 
EM  in  terms  of  properties  of  the  matrix  P  and  the  effect 
that  P  has  on  the  likelihood  surface. 

EM  has  a  number  of  properties  that  make  it  a  par¬ 
ticularly  attractive  algorithm  for  mixture  models.  It  en¬ 
joys  automatic  satisfaction  of  probabilistic  constraints, 
monotonic  convergence  without  the  need  to  set  a  learn¬ 
ing  rate,  and  low  computational  overhead.  Although  EM 
has  the  reputation  of  being  a  slow  algorithm,  we  feel 
that  in  the  mixture  setting  the  slowness  of  EM  has  been 
overstated.  Although  EM  can  indeed  converge  slowly 
for  problems  in  which  the  mixture  components  are  not 
well  separated,  the  Hessian  is  poorly  conditioned  for 
such  problems  and  thus  other  gradient-based  algorithms 
(including  Newton’s  method)  are  also  likely  to  perform 
poorly.  Moreover,  if  one’s  concern  is  convergence  in  like¬ 
lihood,  then  EM  generally  performs  well  even  for  these 
ill-conditioned  problems.  Indeed  the  algorithm  provides 
a  certain  amount  of  safety  in  such  cases,  despite  the  poor 
conditioning.  It  is  also  important  to  emphasize  that 
the  case  of  poorly  separated  mixture  components  can 
be  viewed  as  a  problem  in  model  selection  (too  many 
mixture  components  are  being  included  in  the  model), 
and  should  be  handled  by  regularization  techniques. 

The  fact  that  EM  is  a  first  order  algorithm  certainly 
implies  that  EM  is  no  panacea,  but  does  not  imply  that 
EM  has  no  advantages  over  gradient  ascent  or  superlin- 
ear  methods.  First,  it  is  important  to  appreciate  that 
convergence  rate  results  are  generally  obtained  for  un¬ 
constrained  optimization,  and  are  not  necessarily  indica¬ 
tive  of  performance  on  constrained  optimization  prob¬ 
lems.  Also,  as  we  have  demonstrated,  there  are  condi¬ 
tions  under  which  the  condition  number  of  the  effective 


Hessian  of  the  EM  algorithm  tends  toward  one,  showing 
that  EM  can  approximate  a  superlinear  method.  Finally, 
in  cases  of  a  poorly  conditioned  Hessian,  superlinear  con¬ 
vergence  is  not  necessarily  a  virtue.  In  such  cases  many 
optimization  schemes,  including  EM,  essentially  revert 
to  gradient  ascent. 

We  feel  that  EM  will  continue  to  play  an  important 
role  in  the  development  of  learning  systems  that  empha¬ 
size  the  predictive  aspect  of  data  modeling.  EM  has  in¬ 
deed  played  a  critical  role  in  the  development  of  hidden 
Markov  models  (HMM’s),  an  important  example  of  pre¬ 
dictive  data  modeling.^  EM  generally  converges  rapidly 
in  this  setting.  Similarly,  in  the  case  of  hierarchical  mix¬ 
tures  of  experts  the  empirical  results  on  convergence  in 
likelihood  have  been  quite  promising  (Jordan  &  Jacobs, 
1994;  Waterhouse  k  Robinson,  1994).  Finally,  EM  can 
play  an  important  conceptual  role  as  an  organizing  prin¬ 
ciple  in  the  design  of  learning  algorithms.  Its  role  in  this 
case  is  to  focus  attention  on  the  “missing  variables”  in 
the  problem.  This  clarifies  the  structure  of  the  algorithm 
and  invites  comparisons  with  statistical  physics,  where 
missing  variables  often  provide  a  powerful  analytic  tool. 
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